The effect of data resampling methods in radiomics

Radiomic datasets can be class-imbalanced, for instance, when the prevalence of diseases varies notably, meaning that the number of positive samples is much smaller than that of negative samples. In these cases, the majority class may dominate the model's training and thus negatively affect the model's predictive performance, leading to bias. Therefore, resampling methods are often utilized to class-balance the data. However, several resampling methods exist, and neither their relative predictive performance nor their impact on feature selection has been systematically analyzed. In this study, we aimed to measure the impact of nine resampling methods on radiomic models utilizing a set of fifteen publicly available datasets regarding their predictive performance. Furthermore, we evaluated the agreement and similarity of the set of selected features. Our results show that applying resampling methods did not improve the predictive performance on average. On specific datasets, slight improvements in predictive performance (+ 0.015 in AUC) could be seen. A considerable disagreement on the set of selected features was seen (only 28.7% of features agreed), which strongly impedes feature interpretability. However, selected features are similar when considering their correlation (82.9% of features correlated on average).


Feature agreement and similarity
Resampling changed the selected features that performed best in terms of AUC (Fig. 4).Using the Jaccard index, on average, the set of selected features agreed with only 28.7%.The highest agreement between any oversampling method and no resampling at all was for random oversampling, with an agreement of 40%.Using the Ochai index did not alter these results largely.On average, a higher agreement of around 38.5% were seen (Fig. S1 in the Additional file 1).The largest feature agreement between no resampling were again observed for random oversampling.
Feature similarity was higher and summed up to 86.9% (Fig. 5).Overall, it seemed the worse the resampling method performed relatively, the less similar the selected features were.Using the Zucknick measure led to even smaller feature similarities (Fig. S2 in the Additional file 1).

Figure 1.
Relative predictive performance of the resampling methods.Mean rank and mean gain in AUC, sensitivity and specificity of all resampling methods across all datasets, compared to not resampling (None).Maximum gain in AUC, sensitivity, and specificity denotes the largest difference seen in any of the datasets.Oversampling methods are denoted in blue, undersampling methods in red, and combined methods in green.

Discussion
Resampling methods have often been applied in radiomics, with the promise of improving the predictive performance if the data is unbalanced.In this study, we estimated the impact of different resampling methods on the predictive performance and the selected features across multiple datasets.
Regarding the predictive performance, virtually no improvement was seen compared to not resampling.Even worse, applying undersampling decreased the performance on average.On specific datasets, however, a  www.nature.com/scientificreports/slight increase of up to + 0.015 in AUC could be seen for the SMOTE, showing that when only a single dataset is compared, one method can outperform every other.Yet, these observations do not generalize since SMOTE performed worse on other datasets.The same is also true for the models' sensitivity and specificity.While, on average, no improvement compared to resampling was seen, a higher sensitivity (up to + 0.055) and specificity (up to + 0.032) could be observed on specific datasets.Again, this was dependent on the dataset; it also means that both models performed worse in terms of their sensitivity and specificity on other datasets.More complicated is the situation when comparing the agreement of the set of selected features of the bestperforming models.Even if two different resampling models resulted in similarly performing models, this does not entail that the same amount or the same set of features were chosen.On average, less than one-third of the selected features agreed, which shows that if one were to identify the features with biomarkers, no agreement could be reached when models were trained using different resampling methods.Using the Ochiai index instead of the more commonly used Jaccard index to measure the agreement did not change this picture, although a higher agreement (around 10%) was observed.
However, the picture changed when similarity was considered: On average, each feature selected by the bestperforming model for one resampling method correlated highly to a feature selected by another resampling method.It is partially an effect of the high correlation present in radiomic datasets 16 : Resampling can change the statistical distribution of the features so that the feature selection identifies other features as relevant, but it seems to select highly correlated features, i.e., those that contain similar information.We have employed our own measure for feature similarity since no universally accepted metric exists 17 .This measure intuitively captures the average of the highest correlations between the two feature sets.Alternatively, we have employed the Zucknick measure, which can be understood as a variant of the Jaccard index considering feature similarity.Using this measure, however, led to lower feature similarities.One reason for this difference is that the Zucknick measure takes the number of selected features into account, which our measure does not.The Zucknick measure can subsequently lead to unexpected results when features are duplicated in the feature sets.Therefore, we believe our measure is more appropriate for measuring feature similarity.
Together with the fact the predictive performance did not show improvements, this indicates that resampling in radiomic datasets does not help in radiomics as much as one would hope for.
Given the relatively large amount of radiomic studies utilizing SMOTE, or other resampling methods, our result is surprising.However, Blagus and Lusa analyzed SMOTE on three high-dimensional genetic datasets and concluded that it had no measurable effect on high-dimensional datasets and that undersampling is preferable to SMOTE 18 .Our results confirmed these observations partly.Indeed, none of the resampling methods improved the overall predictive performance, yet, SMOTE and its variants did not result in a drop in predictive Our results could also point towards a publication bias: If resampling did not improve predictive performance, it might have been dropped from reporting, leaving only those studies where resampling did help.Arguably, this would hurt radiomic research from a scientific viewpoint 19 .Another bleak explanation could be that some studies did not apply the resampling correctly.If only cross-validation is used without an independent test set, it is of utmost importance that resampling is applied only to the training set and does not utilize the validation set in any way 20,21 .If this is not followed, a large bias can be expected 22,23 ; yet, this kind of error is common 24 and often cannot be detected without access to the code, which is most often not provided in radiomic studies.There is also the possibility that the setup of these studies differed in some ways from ours; for example, we only used filtering feature selection methods and more simple classifiers.More intricate wrapper methods like SVM-RFE combined with more complex classifiers like XgBoost might perform better in specific datasets.However, we followed rather strictly the usual radiomic pipeline and employed the most often-used methods.
Other studies partly confirm our results.Sarac and Guvenis considered six different resampling methods in a cohort of patients with oropharyngeal cancer to determine their HPV status 25 .They demonstrated that oversampling performed better overall than undersampling, with SMOTE obtaining the highest performance.While in our experiments, oversampling performed on average better than undersampling, and SMOTE performed relatively well, not resampling at all performed best.Unfortunately, Sarac and Guvenis did consider this.In a similar study, Zhang et al. tested four subsampling methods in a cohort of patients with non-small-cell lung cancer (NSCLC) and reported that applying SMOTE improved the performance significantly, with an improvement of + 0.03 in AUC 26 , while other resampling methods seemed to fare less well.However, it must be noted that they extracted only 30 features, which makes their data low-dimensional (more samples than features).Therefore, the improvement might be larger than we observed on a single dataset.In a recent large benchmarking study on non-radiomics datasets, Tarawneh et al. concluded that many resampling methods are not helpful 27 .This result is in line with our study, even though they employed only low-dimensional datasets with very high imbalance, which is often uncommon in radiomics.
In our study, we applied resampling before feature selection, following the observations from Blagus and Lusa 18 .Yet, resampling can be used before and after feature selection, and there are arguments for both choices.Since feature selection methods might be affected strongly by imbalance, applying resampling ahead might be more beneficial 18 .However, using it after feature selection also has some advantages: As the data set is slimmed down by feature selection, the resampling approach will be computationally more effective and will not resample www.nature.com/scientificreports/otherwise irrelevant features.Yet, the situation is not clear: In a recent study on high-dimensional genetic data, Ramos-Pérez et al. 28 demonstrated that the order of resampling and feature selection could depend on the resampling method.They state that random undersampling (RUS) should be ideally performed before feature selection, but random oversampling (ROS) and SMOTE afterward.This result was not observed in our study, where SMOTE applied upfront outperformed RUS.Accordingly, when confronted with a new dataset, both variants should be tested if predictive performance is the goal.Some limitations apply to our study.First, although we have employed a rather large collection of datasets, these were collected opportunistically.We cannot exclude that a potential bias might be present.Furthermore, we could not use external data since there are only very few publicly available datasets where such data is provided.We cannot rule out that resampling methods could help the model to generalize better to external data 29 .Instead, we utilized cross-validation, which can measure robustness with respect to different distributions only in a limited way.In addition, cross-validation could lead to some overfitting.However, this would possibly affect all methods by a similar amount.Due to restricted computational resources, we opted for a fivefold crossvalidation with 30 repeats, although we acknowledge that other validation schemes like leave-one-out CV or using a higher number of repeats could allow for more precise results.In addition, although we tested the most commonly used resampling methods, many more have been developed, especially methods based on generative adversarial networks, which are promising 30 .
The same applies to the feature selection methods and classifiers we employed in this study.We also only considered generic features, that is, those based on morphological, intensity, and textural features, and did not employ datasets with features extracted from deep neural networks [31][32][33] .Since these features might be quantitatively different, our conclusions might not hold for these datasets.Furthermore, we did not apply feature reduction, like principal component analysis, because these methods generate new features which usually do not have a direct interpretation.However, since features are thought to correlate to biomarkers, their interpretation is critical in radiomics.Also, our study only considered AUC as the primary metric for predictive performance and considered sensitivity and specificity as secondary metrics.Depending on the problem, other metrics can be more important.Our study cannot estimate the effect of resampling on these metrics.However, AUC is arguably the most essential metric since it can be considered as the de-facto metric for radiomic studies 34,35 .
Our study demonstrated that, on average, resampling methods did not improve the overall predictive performance of models in radiomics, although this might be the case for a specific dataset.Applying resampling largely changed the set of selected features, which obstructs feature interpretation.However, the set of features was highly correlated, indicating that resampling does not change the information in the data by much.

Methods
In this study, we utilized previously published and publicly accessible datasets to ensure reproducibility.The corresponding ethical review boards granted ethical approval for these datasets.Since the study was retrospective, the local Ethics Committee (Ethik-Kommission, Medizinische Fakultät der Universität Duisburg-Essen, Germany) waived the need for additional ethical approval.The study was conducted following relevant guidelines and regulations.

Datasets
We collected a total of 15 publicly available radiomic datasets (Table 1).These datasets were not collected systematially but were gathered opportunistically, reflecting the scarcity of relevant data in the field 36 .All datasets comprised already extracted radiomics features; no feature generation was performed for this study.Each dataset was prepared by removing non-radiomic features (like clinical or genetic data) before merging all data splits.The datasets were all high-dimensional, meaning they had more features than samples, except for two datasets (Carvalho2018 and Saha2018).

Preprocessing
A few missing values were observed in the datasets, notably in the three datasets by Hosny et al.Here, at most, 0.79%, 0.65% and 0.19% of the values were missing.However, these missings nearly exclusively affected the two features, 'exponential_ngtdm_contrast' and 'exponential_glcm_correlation'; the missings possibly occurred because of numerical overflows due to the exponential function.These two features were consequently removed from the analysis.Other datasets had less than < 0.16% missing values.Due to how radiomics is computed, these values were likely missing completely at random and did not lead to systematic bias.The missing values were removed by imputing them with feature means.All datasets were then normalized by z-Score, i.e., by subtracting the mean of each feature and dividing by the standard deviation.

Resampling methods
Nine different resampling methods were used in this study, encompassing over-and undersampling and combination techniques (Table 2).The undersampling methods, which were random undersampling, edited nearest neighborhood (ENN), all k-NN, and Tomek links, aim to reduce the size of the majority class to match that of the minority class.In contrast, the oversampling methods, random oversampling, synthetic minority oversampling technique (SMOTE), and SVM-SMOTE, aim to increase the size of the minority class to match that of the majority class.Combination methods, like SMOTE + ENN, and SMOTE + Tomek, involve resampling both classes, usually resulting in datasets where the majority class is smaller than before, and the minority class becomes larger.Some resampling methods need a choice of the neighborhood size to consider during the resampling, In the original study SMOTE 37 , the neighborhood size was set to 5. Since this choice might not be optimal for all 15 datasets, in addition, a smaller and a larger size was also considered, i.e., the neighborhood size was chosen www.nature.com/scientificreports/from 3, 5, and 7.However, in a few datasets, a size of 5 or 7 for undersampling methods effectively removed the minority class; therefore, in ENN and SMOTE + ENN, only a neighborhood of size 3 was used.

Feature selection
For the selection of relevant features, four often-used feature selection methods were used 38 : Analysis of variance (ANOVA), Bhattacharyya scores, Extra trees (ET), and the least absolute shrinkage and selection operator (LASSO).Being filter methods, each of them scored the features according to their estimated relevance.The highest-scoring features were then extracted based on a choice of how many features should be included.Here, the number of selected features was chosen on a logarithmic scale among N = 1, 2, 4, … 32, 64.This approach allowed for efficient exploration while maintaining low computational complexity.

Classifiers
Models were trained using often-used classifiers 39 : k-Nearest Neighbor (kNN), logistic regression (LR), naive Bayes, random forest (RF), and kernelized SVM (RBF-SVM).These methods had partly hyperparameters, e.g., in the case of the RBF-SVM, it is known that its performance depends strongly on the choice of the regularization parameter C 40 .This parameter was therefore optimized using a simple grid search on the training data 41,42 , during which it was selected from 2 -10 , 2 -8 , …, 2 -1 , 1, 2 2 , … 2 8 , 2 10 .The kernel width γ of the RBF-SVM was set to the inverse of the mean distance between any two samples.For the RF, the number of trees was set to 250.The neighborhood size of the k-NN was chosen among 1, 3, 5, 7, 9. Finally, the regularization parameter of the logistic regression was also chosen from 2 -10 , 2 -8 , …, 2 -1 , 1, 2 2 , … 2 8 , 2 10 .Other parameters were left at their default values.www.nature.com/scientificreports/

Training
The evaluation followed the standard radiomics pipeline 43,44 and was performed using a fivefold stratified crossvalidation (CV) with 30 repeats (Fig. 6).Stratification was employed to ensure that the original class balance of the data is kept in the test folds as well.In each repeat, first, the data was split into five folds.In turn, each fold was once left out for validation, while the other four folds were used as a training set.It was then resampled using one of the resampling methods.A feature selection method and a classifier were subsequently applied to the resulting data.The final model was then evaluated on the validation fold, i.e., the relevant features were selected in the validation fold first, and then prediction took place with the classifier.

Predictive performance
Since the primary focus in radiomics is obtaining accurate predictions, the macro-averaged area under the receiver operator characteristic curve (AUC) over the five CV validation folds was used to identify the bestperforming model.The best-performing models were then analyzed; models performing worse were discarded.
In addition, the sensitivity and the specificity of the models were computed as secondary metrics.

Feature agreement and similarity
The agreement of the selected features was compared pairwise for all resampling methods over each training fold during the CV.We used the Jaccard index, also called Intersection-over-Union, to measure agreement.Since no universal metric exists, we also employed the Ochiai index 45 .
Since radiomic datasets are known to be highly correlated 16 , two sets of features might look vastly different, although they might describe similar information.Therefore, we computed the similarity between the set of selected features.It is calculated roughly as follows: First, for each feature in the one set, the feature with the highest correlation in the other set is identified.The (symmetrized) mean over all these correlations is then defined as the similarity.More information can be found in Additional file 1.Since this is an ad-hoc metric, we also computed the Zucknick measure 46 , which can be understood as a correlation-corrected version of the Jaccard index 45 .

Software
Experiments were implemented using Python 3.10.The resampling methods were utilized from the Imbalancedlearn package 0.10.0 47 .Our code repository can be found on github, where, apart from the results, a complete list of the dependencies and software versions used in this study can also be found.

Statistics
Descriptive statistics were reported using mean and standard deviation.P-values below 0.05 were considered statistically significant.All statistics were computed using Python 3.10.Resampling methods were compared using a Friedman test and a post hoc Nemenyi test 48 .The Friedman test was preferred over the ANOVA test since it is a non-parametric test and has thus fewer assumptions on the data.Since the Friedman test only tests for the hypothesis of whether there are any differences between the methods, a pairwise post hoc Nemeyi test was

Figure 2 .
Figure 2. Pairwise wins and losses for all resampling methods.Wins and losses of all resampling methods.Each row denotes how often the resampling method won against the other methods (column).Draws between resampling methods counted as 0.5.Oversampling methods are denoted in blue, undersampling methods in red, and combined methods in green.

Figure 3 .
Figure 3. Rankings on each dataset.The rankings on each dataset.Rankings were obtained by sorting the AUCs of the best-performing model.Draws were counted as 0.5.Oversampling methods are denoted in blue, undersampling methods in red, and combined methods in green.

Figure 4 .
Figure 4. Feature agreement using Jaccard index.Agreement of the set of features selected by the resampling methods.For this, the Jaccard index of the selected features on each fold of the cross-validation were computed and averaged.Oversampling methods are denoted in blue, undersampling methods in red, and combined methods in green.

Figure 6 .
Figure 6.Flow chart of the experiments.

Table 1 .
Overview of the datasets used.Only publicly available datasets were used.N denotes the sample size, d the number of features, N+ the number of positive samples, N− the number of negative samples, B is the ratio of the majority class to the minority class.DOI is the digital object identifier of the publication corresponding to the dataset.

Table 2 .
List of resampling methods and parameters.